Search CORE

130 research outputs found

Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

Author: De Clercq Orph´ee
Heuvel Henk van den
Jong Franciska de
Oostdijk Nelleke
Reynaert Martin
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2010
Field of study

In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision taken on the level of text acquisition has ramifications for the levelof processing and the general usability of the corpus. As far as thetraditional text types are concerned, each text brings its own processingrequirements and issues. For new media texts - SMS, chat - the problem is evenmore complex, issues such as anonimity, recognizability and citation right, allpresent problems that have to be tackled. The solutions actually lead to thecreation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes,and the smaller - of commissioned size - more privacy compliant SoNaR,IPR-cleared for commercial purposes as well

CiteSeerX

Ghent University Academic Bibliography

Radboud Repository

University of Twente Research Information

Tilburg University Repository

The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch

Author: A Bosch Van den
A Braasch
C Rijsbergen Van
G Aston
J Leveling
J Trapman
JC Carletta
M Recasens
M Reynaert
Martin W. C. Reynaert
W Daelemans
W Daelemans
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

Springer - Publisher Connector

Tilburg University Repository

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Author: C. Ringlstetter
C.J. Rijsbergen van
D. Lopresti
F.J. Damerau
G. Navarro
G.K. Zipf
K. Kukich
Martin W. C. Reynaert
U. Frauenfelder
W.J. Teahan
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

QUINE corpus in Autosearch

Author: Reynaert Martin
Publication venue: INT - Instituut voor de Nederlandse Taal
Publication date: 01/01/2021
Field of study

The QUINE corpus (version 0.5) consists of virtually all of Quine’s 228 books and articles, containing in total 819 documents (books are split into parts), 2,150,356 word tokens, 38,791 word types and 27,837 lemmatized word types. It includes texts in various genres and from different phases of Quine’s thought on various topics, including technical, and formula-heavy writings on logic and the foundations of mathematics. The corpus exhibits a high degree of lexical variation and many instances of fine-grained meaning distinctions

Tilburg University Repository

KANT corpus in Autosearch

Author: Reynaert Martin
Publication venue: INT - Instituut voor de Nederlandse Taal
Publication date: 01/01/2021
Field of study

Works 1-12 of the 'Gesammelten Werke' of philosopher Immanuel Kant online, available to researchers by invitation, in a corpus exploration and exploitation interface based on WhiteLab by virtue of Autosearch, a CLARIN service provided by INT, or the Institute for the Dutch Language

OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited

Author: Reynaert Martin
Publication venue: ELRA
Publication date: 01/01/2016
Field of study

Contains fulltext : 162481.pdf (publisher's version ) (Open Access)Tenth International Conference on Language Resources and Evaluation (LREC 2016

Radboud Repository

Tilburg University Repository

Non-interactive ocr post-correction for giga-scale digitization projects

Author: Martin Reynaert
Publication venue
Publication date: 01/01/2008
Field of study

Abstract. This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.

CiteSeerX

Tilburg University Repository

QUINE corpus in Autosearch

Author: Reynaert Martin
Publication venue: INT - Instituut voor de Nederlandse Taal
Publication date: 01/01/2021
Field of study

OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited

Author: Reynaert Martin
Publication venue: ELRA
Publication date: 01/01/2016
Field of study

Tilburg University Repository